Coding (and Research) Tips

Jacob Toner Gosselin

2026-01-20

Outline

First go through coding examples for two causal inference methods:

  1. Instrumental Variables (IV)
  2. Difference-in-Differences (DiD)

Then go through some..

  1. General Tips for Research and Writing

Instrumental Variables

IV: The DAG and Endogeneity Problem

  • Consider an RCT. Random assignment R that determines treatment X. So even if we have endogeneity via W, we can identify X -> Y
  • Idea of IV: can we find variable Z that takes the place of R?

IV: The Solution

  • Consider standard linear model: \[ Y = \beta X + \varepsilon \]
  • Assume (1) \(E[X|Z] \neq 0\) and (2) \(E[\varepsilon|Z] = 0\) \[ E[Y|Z] = \beta E[X|Z] + E[\varepsilon|Z] \]
  • Mechanically, this corresponds to:
    1. Explain X with Z, and keep only what is explained, X'
    2. Explain Y with Z, and keep only what is explained, Y'

IV: The Solution (visualized)

IV: Estimation (2SLS)

Most commonly this is estimated using two stage least squares

  1. Use the instruments and controls to explain \(X\) in the first stage
  2. Use the controls and the predicted (explained) part of \(X\) in place of \(X\) in the second stage
  3. (do some standard error adjustments)

Many ways to do this in R, I’ll be doing 2SLS with feols() from fixest

Example 1: Macro Question!

  • How does US income affect US expenditures (“marginal propensity to consume”)?
  • We can instrument with investment from LAST year.
library(AER)
#US income and consumption data 1950-1993
data(USConsump1993)
USC93 <- as.data.frame(USConsump1993)
#lag() gets the observation above; here the observation above is last year
IV <- USC93 %>% mutate(lastyr.invest = lag(income) - lag(expenditure)) 
# 2SLS estimation
m_iv <- feols(expenditure ~ 1 | income ~ lastyr.invest, data = IV, se = 'hetero')

Example 1: Macro Question!

tinytable_5e25lpkrzf2miicps2fd
Income (First Stage) Expenditure
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Income (2SLS) 0.892***
(0.009)
Lagged Investment 8.210***
(0.620)
Num.Obs. 43 43
Std.Errors Heteroskedasticity-robust Heteroskedasticity-robust

Example 1: Stata Replication

* Load the data
import delimited "data/usconsump1993.csv", clear
* generate lagged investment variable and time variable 
gen year = _n + 1949
tsset year
gen lastyr_invest = L.income - L.expenditure
* 2SLS estimation: instrument income with lagged investment
ivregress 2sls expenditure (income = lastyr_invest), vce(robust)

Difference-in-Differences

DiD: The DAG and Endogeneity Problem

  • We compare the time before the treatment to the time after
  • But if anything else is changing over time, we have a problem
  • Need a control group that is not treated

DiD: The Solution

  • Before-After Difference for Control Group: \[ E[Y | C, A] - E[Y | C, B] = Time \]
  • Before-After Difference for Trmt Group: \[ E[Y | T, A] - E[Y | T, B] = Time + Trmt \]
  • Difference-in-Differences: \[ (E[Y | T, A] - E[Y | T, B]) - (E[Y | C, A] - E[Y | C, B]) = Trmt \]

DiD: The Solution (visualized)

DiD: Estimation (TWFE)

  • Standard DiD estimation is two-way fixed effects (TWFE) regression \[ Y = \gamma_i + \gamma_t + \beta Treated + \varepsilon \]
  • Why this works is easy to see if we limit it to a “2x2” DID \[ Y = \gamma_i TrmtGroup + \gamma_t After + \beta TrmtGroup\times After + \varepsilon \]
  • \(\gamma_i\) is prior-period group diff, \(\gamma_t\) is shared time effect, and \(\beta\) is how much bigger the \(TreatedGroup\) effect gets after treatment vs. before, i.e. how much the gap grows (Difference-in-Differences!)

Example 1

  • As a quick example we’ll use data(injury) from library(wooldridge)
  • This is from Meyer, Viscusi, and Durbin (1995) - In Kentucky in 1980, worker’s compensation law changed to increase benefits, but only for high-earning individuals
  • What effect did this have on how long you stay out of work?
  • The treated group is individuals who were already high-earning, and the control group is those who weren’t

Example 1

data(injury, package = 'wooldridge')
injury <- injury %>%
  filter(ky == 1)  %>% # Kentucky only
  mutate(Treated = afchnge*highearn)
m1_did <- feols(ldurat ~ Treated | highearn + afchnge, data = injury)
msummary(m1_did, stars = TRUE, gof_omit = 'FE|RMSE|R2|AIC|BIC|Lik|Adj|Pseudo')
tinytable_qptbalzdxxrrkr06vqez
(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Treated 0.191**
(0.069)
Num.Obs. 5626
Std.Errors IID

Example 1: Stata Replication

* Load the data
import delimited "data/injury_ky.csv", clear
* TWFE DiD regression with fixed effects for group (highearn) and time (afchnge)
reghdfe ldurat treated, absorb(highearn afchnge) vce(robust)

Example 2: Dynamic DiD

  • Often estimate a dynamic effect where we allow effect to be different at different lengths since the treatment
  • Simply interact \(TreatedGroup\) with binary indicators for time period (last period before treatment is the reference) \[ Y = \gamma_i + \gamma_t + \beta_t TreatedGroup + \varepsilon \]
  • Typically plot the \(\beta_t\) coefficients to see how effect evolves over time

Example 2: Dynamic DiD

library(dplyr)
library(fixest)
library(ggplot2)
library(readr)
df <- read_csv('data/eitc.csv') %>%
  mutate(treated = 1*(children > 0)) %>%
  mutate(year = factor(year))
# assert that '1993' is a level of year
stopifnot('1993' %in% levels(df$year))
m <- feols(work ~ i(year, treated, ref = '1993') | treated + year, data = df)
coef_plot <- ggcoefplot(m, ref = c('1993' = 3), pt.join = TRUE) +
  labs(title = "Dynamic Difference-in-Differences Estimates of EITC on Work",
       x = "Year",
       y = "Coefficient Estimate (ref: 1993)") +
  theme_minimal() +
  theme(plot.title = element_text(size = 24),
        axis.text = element_text(size = 18),
        axis.title = element_text(size = 18))

Example 2: Dynamic DiD

General Tips

During Research

  • Organize your data, code, and results as you go
  • Keep things replicable as you go
  • KNOW WHAT YOU’RE DOING
    • Know the difference between know, Know, and KNOW. KNOW everything in your paper.
    • Don’t over-rely on commands; use a few you understand well, and build from there (e.g. feols() in R)
    • LLMs are great when you KNOW what you’re doing.

During Writing

  • Export tables and figures directly from code. No screenshots!
    • Code include examples using texreg (R) or estout (Stata)
  • Can be less work, and easier to keep replicable, to use LaTex
  • Writing is hard. Start early, revise a lot, get feedback.
  • Clear, simple, concise is better. Good papers say one or two things, clearly and thoroughly.